2017_NetMF

一、NetMF [2017]

《Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, PTE, and node2vec》

传统的网络挖掘、网络学习的范式通常从对网络结构属性（structural property）的显式探索而开始。但是许多这类的属性（如中介中心度 betweenness centrality、三角计数 triangle count、模度 modularity）由人工制作，并且需要广泛的领域知识和昂贵的计算代价。鉴于这些问题，以及最近出现的 representation learning 所提供的机会，人们广泛研究了学习网络的潜在 representation（也就是 network embedding ），以便自动发现网络的结构属性并将其映射到一个潜在的空间。
network embedding $G=(V,E,\mathbf A)$ $V$ $E$ $\mathbf A$ $V\rightarrow \mathbb R^d$ $d$ 维向量。输出的 representation 可以用于各种网络科学任务、网络 learning algorithm 的输入，例如标签分类、社区检测。
解决这个问题的尝试可以追溯到谱图理论（spectral graph theory）和社交维度学习（ social dimension learning ）。该问题最近的进展很大程度上受到 SkipGram 模型的影响。SkipGram 模型最初是为 word embedding 所提出的，其输入是由自然语言中的句子组成的文本语料库，输出是语料库中每个单词的 latent vector representation 。值得注意的是，受到该 setting 的启发，DeepWalk 通过将网络上随机游走所遍历的顶点路径视为句子，并利用 SkipGram 来学习潜在的节点 representation，从而开创了 network embedding 的先河。随着 DeepWalk 的出现，人们后续已经开发了很多 network embedding 模型，例如 LINE、PTE、node2vec 。
到目前为止，上述模型已经被证明非常有效。然而，它们背后的理论机制却鲜为人知。人们注意到，用于 word embedding 的、带负采样的 SkipGram 模型已经被证明是某个 word-context 矩阵的隐式分解，并且最近有人努力从几何角度从理论上解释 word embedding 模型。但是，目前尚不清楚 word-context 矩阵与网络结构之间的关系。此外，早期人们尝试从理论上分析 DeepWalk 的行为，然而他们的主要理论结果和原始 DeepWalk 论文的 setting 并不完全一致。此外，尽管 DeepWalk, LINE, PTE, node2vec 之间存在表面上的相似性，但是对它们的底层联系缺乏更深入的了解。
在论文 《Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, PTE, and node2vec》中，作者提供了关于几种基于 SkipGram 的 network embedding 方法的理论结果。具体而言：
- 论文首先表明这里提到的模型（即 DeepWalk, LINE, PTE, node2vec ）在理论上执行隐式矩阵分解。论文为每个模型推导出矩阵的封闭形式（ closed form ）。例如，DeepWalk （图上的随机游走加上 SkipGram）本质上是对一个随机矩阵进行因子分解，并且随着随机游走的长度趋于无穷大，该矩阵在概率上收敛到这个的闭式矩阵。
- 其次，从它们矩阵的闭式形式观察，作者发现有意思的是，LINE 可以视为 DeepWalkSkipGram $T = 1$ 时。此外，作者证明了 PTE 作为 LINE 的扩展，实际上是多网络的联合矩阵（joint matrix）的隐式分解。
- 第三，作者发现了 DeepWalk 的隐式矩阵（implicit matrix ）和图拉普拉斯矩阵（graph Laplacian ）之间的理论联系。基于这种联系，作者提出了一种新算法 NetMF 来逼近 DeepWalk 隐式矩阵的封闭形式。通过使用 SVD 显式分解该矩阵，论文在四个网络（在 DeepWalk 和 node2vec 论文中所采用的）中的广泛实验证明了 NetMF 相比于 DeepWalk 和 LINE 的出色性能（相对提升高达 50%）。
论文的理论价值高于 NetMF 实用的价值，实际上 NetMF 很难应用到工业环境中，因为 NetMF 的时间复杂度和空间复杂度都太高。
论文展示了所有上述带负采样的模型都可以统一到具有封闭形式（closed form ）的矩阵分解框架中。论文的分析和证明表明：
- DeepWalk 经验性（empirically ）地产生了网络归一化拉普拉斯矩阵的低秩变换（low-rank transformation）。
- LINE 理论上是 DeepWalk 的一个特例：当顶点的上下文 size 为 1 时。
- 作为 LINE 的扩展，PTE 可以视为多个网络的拉普拉斯矩阵的联合分解。
- node2vec 正在分解与二阶随机游走的平稳分布、转移概率张量等相关的矩阵。
这项工作为基于 SkipGram 的 network embedding 方法奠定了理论基础，从而更好地理解了潜在的 network representation learning 。
相关工作：network embedding 的故事来源于谱聚类（ Spectral Clustering）。谱聚类是一种数据聚类技术，它选择数据的亲和矩阵（ affinity matrix ）的特征值（eigenvalue ）/特征向量（ eigenvector ）从而获得representation ，从而进一步用于聚类或嵌入到低维空间。谱聚类已广泛应用于社区检测和图像分割等领域。
近年来，人们对 network embedding 越来越感兴趣。继 SocDim 和 DeepWalk 等一些开创性工作之后，越来越多的文献试图从不同的角度解决这个问题，例如 heterogeneous network embedding 、半监督 network embedding、具有丰富顶点属性的 network embedding、具有高阶结构的 network embedding、有符号的 network embedding 、有向的network embedding、通过神经网络的 network embedding 等等。在上述研究中，一种常用的技术是为每个顶点定义 context，然后训练一个预测模型来进行上下文预测。例如，DeepWalk, node2vec, metapath2vec 分别通过基于一阶的随机游走、基于二阶的随机游走、基于 metapath 的随机游走来定义顶点的上下文。
利用上下文信息的思想很大程度上是由带负采样的 SkipGram 模型skip-gram model with negative sampling: SGNS所推动的。最近，人们一直在努力理解这个模型。例如：
- 《NeuralWord Embedding as Implicit Matrix Factorization》 证明了 SGNS 实际上是进行隐式矩阵分解，这为我们提供了分析上述 network embedding 模型的工具。
- 《A latent variable model approach to pmi-based word embeddings》 提出生成模型 RAND-WALK 来解释 word embedding 模型。
- 《Word embeddings as metric recovery in semantic spaces》 将 word embedding 框架作为度量学习问题。
基于隐式矩阵分解的工作，我们从理论上分析了流行的、基于 SkipGram 的 network embedding 模型，并将它们与谱图理论联系起来。

1.1 模型

这里我们首先介绍四种流行的 network embedding 方法（LINE, PTE, DeepWalk, node2vec）的详细理论分析和证明，然后提出我们的 NetMF 方法。

1.1.1 LINE

$G=(V,E,\mathbf A)$ ，二阶邻近性的LINE（即 LINE(2nd) ）的任务是学到两个representation 矩阵：
- vertex represetation $\mathbf X \in \mathbb R^{|V|\times d}$ $i$ $\mathbf{\vec x}_i$ $v_i$ 作为vertex 时的 embedding 向量。
- context representation $\mathbf Y \in \mathbb R^{|V|\times d}$ $i$ $\mathbf{\vec y}_i$ $v_i$ 作为contex 时的 embedding 向量。
LINE(2nd) 的目标函数为：
$L = \sum_{i = 1}^{| V |} \sum_{j = 1}^{| V |} A_{i, j} (\log σ ({\vec{x}}_{i} \cdot {\vec{y}}_{j}) + b E_{j^{'} \sim P_{N}} [\log σ (- {\vec{x}}_{i} \cdot {\vec{y}}_{j^{'}})])$
$\sigma(\cdot)$ sigmoid $b$ $P_N$ 为用于产生负样本的 noiseLINE $P_N(j)\propto d_j^{3/4}$ $d_j$ $j$ degree $d_j = \sum_{k=1}^{|V|}A_{j,k}$ 。
LINE(2nd) $-\sum_{(i,j)\in E} A_{i,j}\times \log \frac{\exp(\mathbf{\vec x}_i\cdot \mathbf{\vec y}_j)}{\sum_{k=1}^{|V|}\exp(\mathbf{\vec x}_i\cdot \mathbf{\vec y}_k)}$ $-\sum_{(i,j)\in E} A_{i,j}\times \log p_2(v_j\mid v_i)$ $(i,j)\notin E$ $A_{i,j} = 0$ $\mathcal L$ $A_{i,j}$ 不仅作用在“正边”、也作用在 “负边” 上的原因。
$P_N(j)\propto d_j$ closed form solution $\text{vol}_G = \sum_{i=1}^{|V|}\sum_{j=1}^{|V|}A_{i,j}=\sum_{j=1}^{|V|}d_j$ degree $P_N(j) = \frac{d_j}{\text{vol}_G}$ 。我们重写目标函数为：
$L = (\sum_{i = 1}^{| V |} \sum_{j = 1}^{| V |} A_{i, j} \log σ ({\vec{x}}_{i} \cdot {\vec{y}}_{j})) + (b \sum_{i = 1}^{| V |} d_{i} E_{j^{'} \sim P_{N}} [\log σ (- {\vec{x}}_{i} \cdot {\vec{y}}_{j^{'}})])$
$G$ 的所有顶点上计算，从而得到期望为：
$E_{j^{'} \sim P_{N}} [\log σ (- {\vec{x}}_{i} \cdot {\vec{y}}_{j^{'}})] = \sum_{j = 1}^{| V |} \frac{d_{j}}{{vol}_{G}} \log σ (- {\vec{x}}_{i} \cdot {\vec{y}}_{j})$
因此有：
$L = \sum_{i = 1}^{| V |} \sum_{j = 1}^{| V |} (A_{i, j} \log σ ({\vec{x}}_{i} \cdot {\vec{y}}_{j}) + b \frac{d_{i} d_{j}}{{vol}_{G}} \log σ ({- \vec{x}}_{i} \cdot {\vec{y}}_{j}))$
则对于每一对顶点 (i,j)，其局部目标函数（local objective function ）为：
$L (i, j) = A_{i, j} \log σ ({\vec{x}}_{i} \cdot {\vec{y}}_{j}) + b \frac{d_{i} d_{j}}{{vol}_{G}} \log σ ({- \vec{x}}_{i} \cdot {\vec{y}}_{j})$
$z_{i,j} = \mathbf{\vec x}_i \cdot \mathbf{\vec y}_j$ ，根据 《NeuralWord Embedding as Implicit Matrix Factorization》embedding $z_{i,j}$ 之间可以认为是相对独立的。因此我们有：
$\frac{\partial L}{\partial z_{i, j}} = \frac{\partial L (i, j)}{\partial z_{i, j}} = A_{i, j} σ (- z_{i, j}) - b \frac{d_{i} d_{j}}{{vol}_{G}} σ (z_{i, j})$
为求解目标函数极大值，我们令偏导数为零，则有：
$\exp (2 z_{i, j}) - (\frac{{vol}_{G} A_{i, j}}{b d_{i} d_{j}} - 1) \exp (z_{i, j}) - \frac{{vol}_{G} A_{i, j}}{b d_{i} d_{j}} = 0$
这个方程有两个闭式解：
- $\exp(z_{i,j}) = -1$ ：其解为虚数，不予考虑。
- $\exp(z_{i,j}) = \frac{\text{vol}_G A_{i,j}}{bd_id_j}$ ：有效解。
因此有：
${\vec{x}}_{i} \cdot {\vec{y}}_{j} = z_{i, j} = \log (\frac{{vol}_{G} A_{i, j}}{b d_{i} d_{j}})$
则 LINE(2nd) 对应于矩阵分解：
$X Y^{⊤} = \log ({vol}_{G} D^{- 1} A D^{- 1}) - (\log b) I$
$\mathbf D = \text{diag}(d_1,\cdots,d_{|V|})$ 。

1.1.2 PTE

PTE 是 LINE(2nd) 在异质文本网络heterogeneous text networkLINE(2nd) $G=(V_1\cup V_2, E, \mathbf A)$ $V_1$ $V_2$ $E\sube V_1\times V_2$ $\mathbf A\in \mathbb R^{|V_1|\times |V_2|}$ $G$ volume $\text{vol}_G = \sum_{i=1}^{|V_1|}\sum_{j=1}^{|V_2|} A_{i,j}$ PTE $v_i\in V_1$ representation $\mathbf{\vec x}_i$ $v_j\in V_2$ representation $\mathbf{\vec y}_j$ 。则 LINE(2nd) 的目标函数为：
$L = \sum_{i = 1}^{| V_{1} |} \sum_{j = 1}^{| V_{2} |} A_{i, j} (\log σ ({\vec{x}}_{i} \cdot {\vec{y}}_{j}) + b E_{j^{'} \sim P_{N}} [\log σ (- {\vec{x}}_{i} \cdot {\vec{y}}_{j^{'}})])$
$V_1$ $V_2$ $V_1$ embedding $V_2$ 上下文的 embedding 变量。
LINE $\mathcal L$ 等价于因子分解：
$X Y^{⊤} = \log ({vol}_{G} D_{row}^{- 1} A D_{col}^{- 1}) - (\log b) I$
$\mathbf D_\text{row} = \text{diag}(\mathbf A\mathbf{\vec e})$ $\mathbf A$ $\mathbf D_\text{col} = \text{diag}(\mathbf A^\top\mathbf{\vec e})$ $\mathbf A$ $\mathbf{\vec e}$ 为全 1 的向量。
$\mathbf A = \mathbf A^\top$ $\mathbf D_\text{row} = \mathbf D_\text{col}$ 。
PTE $\mathbb V$ $\mathbb D$ $\mathbb L$ ，我们将文本网络分为三个子网络：
- word-word $G^{w,w}$ ：每个 word 是一个顶点，边的权重为两个 wordT $\mathbf A^{w,w}$ $d_{i\cdot,}^{w,w} = \sum_j A_{i,j}^{w,w}$ $\mathbf A^{w,w}$ $i$ $d_{\cdot,j}^{w,w} = \sum_i A_{i,j}^{w,w}$ $\mathbf A^{w,w}$ $j$ $\mathbf D_\text{row}^{w,w} = \text{diag}(d^{w,w}_{1,\cdot},\cdots,d^{w,w}_{|\mathbb V|,\cdot})$ $\mathbf D_\text{col}^{w,w} = \text{diag}(d^{w,w}_{\cdot,1},\cdots,d^{w,w}_{\cdot,|\mathbb V|})$ $\mathbf A^{w,w}$ 的各行之和、各列之和组成。
- word-document $G^{d,w}$ ：每个 word 和 documentword $\mathbf D_\text{row}^{w,d} = \text{diag}(d^{w,d}_{1,\cdot},\cdots,d^{w,d}_{|\mathbb V|,\cdot})$ $\mathbf D_\text{col}^{w,d} = \text{diag}(d^{w,d}_{\cdot,1},\cdots,d^{w,d}_{\cdot,|\mathbb D|})$ $\mathbf A^{w,d}$ 的各行之和、各列之和组成。
- word-label $G^{w,l}$ ：每个 word 和 label 都是一个顶点，边的权重为 wordlabel $\mathbf D_\text{row}^{w,l} = \text{diag}(d^{w,l}_{1,\cdot},\cdots,d^{w,l}_{|\mathbb V|,\cdot})$ $\mathbf D_\text{col}^{w,l} = \text{diag}(d^{w,l}_{\cdot,1},\cdots,d^{w,l}_{\cdot,|\mathbb L|})$ $\mathbf A^{w,d}$ 的各行之和、各列之和组成。
PTE 的损失函数为：
$\begin{matrix} L = α L_{w, w} + β L_{w, d} + γ L_{w, l} = \\ α \sum_{i = 1}^{| V |} \sum_{j = 1}^{| V |} (A_{i, j}^{w, w} \log σ ({\vec{x}}_{i}^{w, w} \cdot {\vec{y}}_{j}^{w, w}) + b \frac{d_{i, \cdot}^{w, w} d_{i, \cdot}^{w, w}}{{vol}_{G}^{w, w}} \log σ ({- \vec{x}}_{i}^{w, w} \cdot {\vec{y}}_{j}^{w, w})) \\ + β \sum_{i = 1}^{| V |} \sum_{j = 1}^{| D |} (A_{i, j}^{w, d} \log (σ ({\vec{x}}_{i}^{w, d} \cdot {\vec{y}}_{j}^{w, d}) + b \frac{d_{i}^{w, d} d_{j}^{w, d}}{{vol}_{G}^{w, d}} \log σ ({- \vec{x}}_{i}^{w, d} \cdot {\vec{y}}_{j}^{w, d})) \\ + γ \sum_{i = 1}^{| V |} \sum_{j = 1}^{| L |} (A_{i, j}^{w, l} \log σ ({\vec{x}}_{i}^{w, l} \cdot {\vec{y}}_{j}^{w, l}) + b \frac{d_{i}^{w, l} d_{j}^{w, l}}{{vol}_{G}^{w, l}} \log σ ({- \vec{x}}_{i}^{w, l} \cdot {\vec{y}}_{j}^{w, l})) \end{matrix}$
$(\cdot)^{w,w},(\cdot)^{w,d},(\cdot)^{w,l}$ $\alpha,\beta,\gamma$ $b$ PTE $\alpha,\beta,\gamma$ $\alpha \text{vol}_{G_{w,w}} = \beta \text{vol}_{G_{w,d}} = \gamma \text{vol}_{G_{w,l}}$ ，这是因为PTE 在训练期间执行边采样，其中边是从三个子网中交替采样得到。
根据前面的结论有：
$\begin{matrix} {\vec{x}}_{i}^{w, w} \cdot {\vec{y}}_{j}^{w, w} = \log (\frac{{vol}_{G}^{w, w} A_{i, j}^{w, w}}{b d_{i}^{w, w} d_{j}^{w, w}}) \\ {\vec{x}}_{i}^{w, d} \cdot {\vec{y}}_{j}^{w, d} = \log (\frac{{vol}_{G}^{w, d} A_{i, j}^{w, d}}{b d_{i}^{w, d} d_{j}^{w, d}}) \\ {\vec{x}}_{i}^{w, l} \cdot {\vec{y}}_{j}^{w, l} = \log (\frac{{vol}_{G}^{w, l} A_{i, j}^{w, l}}{b d_{i}^{w, l} d_{j}^{w, l}}) \end{matrix}$
令：
$\begin{matrix} X = [\begin{matrix} ({\vec{x}}_{1}^{w, w})^{⊤} & 0 & 0 \\ ⋮ & ⋮ & ⋮ \\ ({\vec{x}}_{| V |}^{w, w})^{⊤} & 0 & 0 \\ 0 & ({\vec{x}}_{1}^{w, d})^{⊤} & 0 \\ ⋮ & ⋮ & ⋮ \\ 0 & ({\vec{x}}_{| V |}^{w, d})^{⊤} & 0 \\ 0 & 0 & ({\vec{x}}_{1}^{w, l})^{⊤} \\ ⋮ & ⋮ & ⋮ \\ 0 & 0 & ({\vec{x}}_{| V |}^{w, l})^{⊤} \end{matrix}] Y = [\begin{matrix} ({\vec{y}}_{1}^{w, w})^{⊤} & 0 & 0 \\ ⋮ & ⋮ & ⋮ \\ ({\vec{y}}_{| V |}^{w, w})^{⊤} & 0 & 0 \\ 0 & ({\vec{y}}_{1}^{w, d})^{⊤} & 0 \\ ⋮ & ⋮ & ⋮ \\ 0 & ({\vec{y}}_{| D |}^{w, d})^{⊤} & 0 \\ 0 & 0 & ({\vec{y}}_{1}^{w, l})^{⊤} \\ ⋮ & ⋮ & ⋮ \\ 0 & 0 & ({\vec{y}}_{| L |}^{w, l})^{⊤} \end{matrix}] \end{matrix}$
$\mathbf X \in \mathbb R^{3|\mathbb V|\times 3d},\mathbf Y \in \mathbb R^{(|\mathbb V|+|\mathbb D| +|\mathbb L|)\times 3d}$ ，且有：
$\begin{matrix} M_{1} = α {vol}_{G_{w, w}} (D_{r o w}^{w, w})^{- 1} A^{w, w} (D_{c o l}^{w, w})^{- 1} \\ M_{2} = β {vol}_{G_{d, w}} (D_{r o w}^{d, w})^{- 1} A^{d, w} (D_{c o l}^{d, w})^{- 1} \\ M_{3} = γ {vol}_{G_{l, w}} (D_{r o w}^{l, w})^{- 1} A^{l, w} (D_{c o l}^{l, w})^{- 1} \\ X Y^{⊤} = \log ([\begin{matrix} M_{1} & 0 & 0 \\ 0 & M_{2} & 0 \\ 0 & 0 & M_{3} \end{matrix}]) - (\log b) I \end{matrix}$

1.1.3 DeepWalk

DeepWalkcorpus $\mathcal D$ $\mathcal D$ 上训练 SkipGram 模型。这里我们重点讨论带负采样的 SkipGram 模型（skipgram with negative sampling: SGNS ）。整体算法如下所示：
- 输入：
  - $G(V,E,\mathbf A)$
  - $T$
  - $L$
  - $N$
- embedding $\mathbf X$
- 算法步骤：
  - $\text{for}\; s = 1,2,\cdots,N$ ，迭代过程为：
    - $P(w)$ $w^{<s>}_1$ 。
    - $G$ $w^{<s>}_1$ $L$ $(w_1^{<s>},\cdots,w_L^{<s>})$ 。
    - $j=1,2,\cdots,L-T$ ：
      - $r$ $r=1,2,\cdots,T$ ：
        vertex-context $(w_j^{<s>},w_{j+r}^{<s>})$ $\mathcal D$ 中。
        vertex-context $(w_{j+r} ^{<s>},w_{j}^{<s>})$ $\mathcal D$ 中。
  - $\mathcal D$ $b$ 的 SGNS 。
根据论文 《Neural Word Embedding as Implicit Matrix Factorization》 ， SGNS 等价于隐式的矩阵分解：
$\begin{matrix} X Y^{⊤} = M \\ M_{w, c} = \log (\frac{n (w, c) \times | D |}{n (w) \times n (c)}) - \log b \end{matrix}$
$|\mathcal D|$ $n(w,c)$ $\mathcal D$ vertex-context $(w,c)$ $n(w)$ vertex $w$ $n(c)$ context $c$ $b$ 为负采样系数。
$\mathbf M$ 刻画了图结构的什么属性？目前并没有相关的分析工作。此外，这里是否可以去掉 log、是否可以去掉 log b ，也没有理论的解释。
接下来的分析依赖于一些关键的假设：
- $G$ 为无向的（undirected）、连接的（connectednon-bipartite $P(w) = d_w/\text{vol}_G$ 为一个平稳分布。
- $P(w)$ 中随机选取。
$r=1,2,\cdots,T$ ，定义：
- $\mathcal D_{r\rightarrow} = \{(w,c)\mid (w,c)\in \mathcal D, w = w_j^{<s>},c = w_{j+r}^{<s>}\}$ $\mathcal D_{r\rightarrow}$ $\mathcal D$ context $c$ vertex $w$ $r$ $n(w,c)_{r\rightarrow}$ $\mathcal D_{r\rightarrow}$ vertex-context $(w,c)$ 共现的次数。
- $\mathcal D_{r\leftarrow} = \{(w,c)\mid (w,c)\in \mathcal D, w = w_{j+r}^{<s>},c = w_{j}^{<s>}\}$ $\mathcal D_{r\rightarrow}$ $\mathcal D$ context $c$ vertex $w$ $r$ $n(w,c)_{r\leftarrow}$ $\mathcal D_{r\leftarrow}$ vertex-context $(w,c)$ 共现的次数。
- $\mathbf P = \mathbf D^{-1}\mathbf A$ $\mathbf P^r = \underbrace {\mathbf P\times \cdots \times \mathbf P}_r$ $r$ $\mathbf P$ $\mathbf D = \text{diag}(d_1,\cdots,d_{|V|})$ $d_w = \sum_{c}A_{w,c}$ $\text{vol}_G=\sum_{w}\sum_{c}A_{w,c}$ $P^r_{w,c}$ $\mathbf P^r$ $w$ $c$ 列。
  $\mathbf P$ $P_{w,c}$ $w$ $c$ $P^r_{w,c}$ $w$ $r$ $c$ 的概率。
$\mathbf A= \mathbf A^{\top}$ 。
$L\rightarrow \infty$ 时有：
$\begin{matrix} \frac{n (w, c)_{r \to}}{| D_{r \to} |} \overset{p}{\to} \frac{d_{w}}{{vol}_{G}} P_{w, c}^{r} \\ \frac{n (w, c)_{r \leftarrow}}{| D_{r \leftarrow} |} \overset{p}{\to} \frac{d_{c}}{{vol}_{G}} P_{c, w}^{r} \end{matrix}$
$\stackrel{p}{\rightarrow}$ 表述依概率收敛。
证明：
S.N. Bernstein $Y_1,Y_2,\cdots$ $\mathbb E[Y_j]\lt K$ $\text{Var} (Y_j) \lt K$ $|i-j|\rightarrow \infty$ $\text{Cov}(Y_i,Y_j)\rightarrow 0$ 。则大数定律（ law of large numbers:LLN ）成立。
$N=1$ $(w_1,\cdots,w_L)$ vertex-context $(w,c)$ $Y_j,j=1,2,\cdots,L-T$ $w_j=w,w_{j+r} = c$ 发生的指示器（ indicator ）。
我们观察到：
- $|\mathcal D_{r\rightarrow}| = L-T$ $\sum_{j=1}^{L-T} Y_j = n(w,c)_{r\rightarrow}$ 。因此有：
  $\frac{n (w, c)_{r \to}}{| D_{r \to} |} = \frac{1}{L - T} \sum_{j = 1}^{L - T} Y_{j}$
- $Y_j$ $w_j=w$ $w$ $r$ $c$ 的概率。即：
  $E [Y_{j}] = P (Y_{j}) = \frac{d_{w}}{{vol}_{G}} \times P_{w, c}^{r}$
- $j\gt i+r$ 时有：
  $\begin{matrix} E [Y_{i} Y_{j}] = P (w_{i} = w, w_{i + r} = c, w_{j} = w, w_{j + r} = c) \\ = \frac{d_{w}}{{vol}_{G}} P_{w, c}^{r} \times P_{c, w}^{j - i - r} \times P_{w, c}^{r} \end{matrix}$
  $w_i$ $w$ $w$ $r$ $c$ $c$ $j-(r+i)$ $w$ $w$ $r$ $c$ 的概率。
则有：
$Cov (Y_{i}, Y_{j}) = E [Y_{i} Y_{j}] - E [Y_{i}] E [Y_{j}] = \frac{d_{w}}{{vol}_{G}} P_{w, c}^{r} [P_{c, w}^{j - i - r} - \frac{d_{w}}{{vol}_{G}}] P_{w, c}^{r}$
$|j-i|\rightarrow \infty$ $c$ $\infty$ $w$ $p(w_j=w)$ 。即：
$lim_{| j - i | \to \infty} P_{c, w}^{j - i - r} = \frac{d_{w}}{{vol}_{G}}$
$\lim_{|j- i|\rightarrow \infty} \text{Cov}(Y_i,Y_j) = 0$ 。因此随机游走序列收敛到它的平稳分布。
应用大数定律，则有：
$\frac{n (w, c)_{r \to}}{| D_{r \to} |} = \frac{1}{L - T} \sum_{j = 1}^{L - T} Y_{j} \overset{p}{\to} \frac{1}{L - T} \sum_{j = 1}^{L - T} E [Y_{j}] = \frac{d_{w}}{{vol}_{G}} P_{w, c}^{r}$
类似地，我们有：
$\frac{n (w, c)_{r \leftarrow}}{| D_{r \leftarrow} |} = \frac{d_{c}}{{vol}_{G}} P_{c, w}^{r}$
$N\gt 1$ $Y_j^{<s>},s=1,2,\cdots,N,j=1,2,\cdots,L-T$ $w_j^{<s>} = w, w_{j+r}^{(s)} = c$ 的指示器，同样可以证明相同的结论。
$j\rightarrow \infty$ 时，有：
$lim_{j \to \infty} P (w_{j} = w, w_{j + r} = c) = \frac{d_{w}}{{vol}_{G}} P_{w, c}^{r}$
因此定理一仍然成立。
$L\rightarrow\infty$ 时，有：
$\frac{n (w, c)}{| D |} \overset{p}{\to} \frac{1}{2 T} \sum_{r = 1}^{T} (\frac{d_{w}}{{vol}_{G}} P_{w, c}^{r} + \frac{d_{c}}{{vol}_{G}} P_{c, w}^{r})$
证明：
$\frac{|\mathcal D_{r\rightarrow}|}{|\mathcal D|}=\frac{|\mathcal D_{r\leftarrow}|}{|\mathcal D|} = \frac{1}{2T}$ ，应用定理一有：
$\begin{matrix} \frac{n (w, c)}{| D |} = \frac{\sum_{r = 1}^{T} (n (w, c)_{r \to} + n (w, c)_{r \leftarrow})}{\sum_{r = 1}^{T} (| D_{r \to} | + | D_{r \leftarrow} |)} = \frac{1}{2 T} \sum_{r = 1}^{T} (\frac{n (w, c)_{r \to}}{| D_{r \to} |} + \frac{n (w, c)_{r \leftarrow}}{| D_{r \leftarrow} |}) \\ \overset{p}{\to} \frac{1}{2 T} \sum_{r = 1}^{T} (\frac{d_{w}}{{vol}_{G}} P_{w, c}^{r} + \frac{d_{c}}{{vol}_{G}} P_{c, w}^{r}) \end{matrix}$
$w$ $c$ $L\rightarrow \infty$ 时，我们有：
$\frac{n (w)}{| D |} \overset{p}{\to} \frac{d_{w}}{{vol}_{G}}, \frac{n (c)}{| D |} \overset{p}{\to} \frac{d_{c}}{{vol}_{G}}$
$r$ $r$ 的 vertex-contextvertex-context $1/(2T)$ 。
DeepWalk $L\rightarrow \infty$ 时有：
$\frac{n (w, c) | D |}{n (w) n (c)} \overset{p}{\to} \frac{{vol}_{G}}{2 T} (\frac{1}{d_{c}} \sum_{i = 1}^{T} P_{w, c}^{r} + \frac{1}{d_{w}} \sum_{i = 1}^{T} P_{c, w}^{r})$
因此DeepWalk 等价于因子分解：
$X Y^{⊤} = \log (\frac{{vol}_{G}}{T} (\sum_{r = 1}^{T} P^{r}) D^{- 1}) - (\log b) I$
证明：
利用定理二和continous mapping theorem，有：
$\begin{matrix} \frac{n (w, c) | D |}{n (w) n (c)} = \frac{\frac{n (w, c)}{| D |}}{\frac{n (w)}{| D |} \frac{n (c)}{| D |}} \overset{p}{\to} \frac{\frac{1}{2 T} \sum_{r = 1}^{T} (\frac{d_{w}}{{vol}_{G}} P_{w, c}^{r} + \frac{d_{c}}{{vol}_{G}} P_{c, w}^{r})}{\frac{d_{w}}{{vol}_{G}} \times \frac{d_{c}}{{vol}_{G}}} \\ = \frac{{vol}_{G}}{2 T} (\frac{1}{d_{c}} \sum_{r = 1}^{T} P_{w, c}^{r} + \frac{1}{d_{w}} \sum_{r = 1}^{T} P_{c, w}^{r}) \end{matrix}$
写成矩阵的形式为：
$\begin{matrix} \frac{{vol}_{G}}{2 T} (\sum_{r = 1}^{T} P^{r} D^{- 1} + \sum_{r = 1}^{T} D^{- 1} (P^{r})^{⊤}) \\ = \frac{{vol}_{G}}{2 T} (\sum_{r = 1}^{T} (\underset{r}{\underset{⏟}{D^{- 1} A \times \dots \times D^{- 1} A}}) D^{- 1} + \sum_{r = 1}^{T} D^{- 1} (\underset{r}{\underset{⏟}{A D^{- 1} \times \dots \times A D^{- 1}}})) \\ = \frac{{vol}_{G}}{T} (\sum_{r = 1}^{T} (\underset{r}{\underset{⏟}{D^{- 1} A \times \dots \times D^{- 1} A}}) D^{- 1}) = \frac{{vol}_{G}}{T} (\sum_{r = 1}^{T} P^{r}) D^{- 1} \end{matrix}$
$T=1$ 时，DeepWalk 就成为了 LINE(2nd)，因此 LINE(2nd) 是 DeepWalk 的一个特例。

1.1.4 node2vec

node2vec 是最近提出的 graph embedding 方法，其算法如下：
- 输入：
  - $G(V,E,\mathbf A)$
  - $T$
  - $L$
  - $N$
- 输出：顶点
- 算法步骤：
  - $\mathbf P\in \mathbb R^{|V| \times |V|\times |V|}$
  - $\text{for}\; s = 1,2,\cdots,N$ ，迭代过程为：
    - $Q(w_1,w_2)$ $(w^{<s>}_1,w_2^{<s>})$ 。
    - $G$ $w^{<s>}_1,w^{<s>}_2$ $L$ $(w_1^{<s>},\cdots,w_L^{<s>})$ 。
    - $j=2,\cdots,L-T$ ：
      - $r$ $r=1,2,\cdots,T$ ：
        $(w_j^{<s>},w_{j+r}^{<s>},w_{j-1}^{<s>})$ $\mathcal D$ 中。
        $(w_{j+r} ^{<s>},w_{j}^{<s>},w_{j-1}^{<s>})$ $\mathcal D$ 中。
  - $\mathcal D^\prime=\{(w,c)\mid (w,c,u)\in \mathcal D \}$ $b$ 的 SGNS 。
  $(w_j,w_{j+r},w_{j-1})$ ，而不是vertex-context 二元组。
node2vec $\mathbf P$ 采取如下的方式定义：
- 首先定义未归一化的概率：
  $\begin{matrix} {\hat{P}}_{w, v, u} = {\begin{cases} \frac{1}{p}, & (u, v) \in E, (v, w) \in E, u = w \\ 1, & (u, v) \in E, (v, w) \in E, u \neq w, (w, u) \in E \\ \frac{1}{q}, & (u, v) \in E, (v, w) \in E, u \neq w, (w, u) \notin E \\ 0, & else \end{cases} \end{matrix}$
  $\hat P_{w,v,u}$ $w_{j-1}=u,w_j = v$ $w_{j+1} = w$ 的概率。
- 然后得到归一化的概率：
  $P_{w, v, u} = P (w_{j + 1} = w ∣ w_{j} = v, w_{j - 1} = u) = \frac{{\hat{P}}_{w, v, u}}{\sum_{u^{'}} {\hat{P}}_{w, v, u^{'}}}$
类似 DeepWalk ，我们定义：
$\begin{matrix} D_{r \to} = {(w, c, u) ∣ (w, c, u) \in D, w_{j}^{n} = w, w_{j + r}^{n} = c, w_{j - 1}^{n} = u} \\ D_{r \leftarrow} = {(w, c, u) ∣ (w, c, u) \in D, w_{j + r}^{n} = w, w_{j}^{n} = c, w_{j - 1}^{n} = u} \end{matrix}$
$u$ 为previous 顶点。
$n(w,c,u)_{\rightarrow}$ $(w,c,u)$ $\mathcal D_{r\rightarrow}$ $n(w,c,u)_{\leftarrow}$ $(w,c,u)$ $\mathcal D_{r\leftarrow}$ 中出现的次数。
$\mathbf Q$ $\sum_{u} P_{w,v,u}Q_{v,u} = Q_{w,v}$ Perron-Frobenius $\mathbf Q$ 。
$P^r_{w,v,u} = P(w_{j+r}=w\mid w_j=v,w_{j-1}=u)$ 。
由于篇幅有限，这里给出 node2vec 的主要结论，其证明过程类似 DeepWalk ：
$\begin{matrix} \frac{n (w, c, u)_{r \to}}{| D_{r \to} |} \overset{p}{\to} Q_{w, u} P_{c, w, u}^{r}, \frac{n (w, c, u)_{r \leftarrow}}{| D_{r \leftarrow} |} \overset{p}{\to} Q_{c, u} P_{w, c, u}^{r} \\ \frac{n (w, c)_{r \to}}{| D_{r \to} |} = \frac{\sum_{u} n (w, c, u)_{r \to}}{| D_{r \to} |} \overset{p}{\to} \sum_{u} Q_{w, u} P_{c, w, u}^{r} \\ \frac{n (w, c)_{r \leftarrow}}{| D_{r \leftarrow} |} = \frac{\sum_{u} n (w, c, u)_{r \leftarrow}}{| D_{r \leftarrow} |} \overset{p}{\to} \sum_{u} Q_{c, u} P_{w, c, u}^{r} \\ \frac{n (w, c)}{| D |} \overset{p}{\to} \frac{1}{2 T} \sum_{r = 1}^{T} (\sum_{u} Q_{w, u} P_{c, w, u}^{r} + \sum_{u} Q_{c, u} P_{w, c, u}^{r}) \\ \frac{n (w)}{| D |} \overset{p}{\to} \sum_{u} Q_{w, u}, \frac{n (c)}{| D |} \overset{p}{\to} \sum_{u} Q_{c, u} \end{matrix}$
因此 node2vec 有：
$\frac{n (w, c) \times | D |}{n (w) \times n (c)} \overset{p}{\to} \frac{\frac{1}{2 T} \sum_{r = 1}^{T} (\sum_{u} Q_{w, u} P_{c, w, u}^{r} + \sum_{u} Q_{c, u} P_{w, c, u}^{r})}{(\sum_{u} Q_{w, u}) \times (\sum_{u} Q_{c, u})}$
尽管实现了 node2vec 的封闭形式，我们将其矩阵形式的公式留待以后研究。
$\mathbf P^r$ $\mathbf Q$ 进行低秩分解来降低时间复杂度和空间复杂度：
$Q_{u, v} = {\vec{q}}_{u} \cdot {\vec{q}}_{v}$
由于篇幅限制，我们这里主要集中在一阶随机游走框架DeepWalk 上。

1.1.5 NetMF

根据前面的分析我们将 LINE, PTE, DeepWalk, node2vec 都统一到矩阵分解框架中。这里我们主要研究 DeepWalk 矩阵分解，因为它比 LINE 更通用、比 node2vec 计算效率更高。
首先我们引用了四个额外的定理：
- $\mathbf L = \mathbf I - \mathbf D^{-1/2}\mathbf A \mathbf D^{-1/2}$ ，则它的特征值都是实数。
  而且，假设它的特征值从大到小排列，则有：
  $2 \geq λ_{1} \geq λ_{2} \geq \dots \geq λ_{n} = 0$
  connected $n\gt 1$ $\lambda_1\ge \frac{n}{n-1}$ 。
  证明参考：《Spectral graph theory》。
- 定理五：实对称矩阵的奇异值就是该矩阵特征值的绝对值。
  证明参考：《Numerical linear algebra》。
- $\mathbf B\in \mathbb R^{n\times n} ,\mathbf C \in \mathbb R^{n\times n}$ $\mathbf B,\mathbf C,\mathbf {BC}$ $1\le i,j\le n,i+j\le n+1$ ，以下不等式成立：
  $σ_{i + j - 1} (B C) \leq σ_{i} (B) \times σ_{j} (C)$
  $\sigma_i(\cdot)$ $i$ 个奇异值。
  证明参考：《Topics in Matrix Analysis》。
- $\mathbf A$ ，定义它的瑞利商为：
  $R (A, \vec{x}) = \frac{{\vec{x}}^{⊤} A \vec{x}}{{\vec{x}}^{⊤} \vec{x}}$
  $\mathbf A$ $\lambda_1\ge\cdots\ge\lambda_n$ ，则有：
  $λ_{n} = min_{\vec{x} \neq \vec{0}} R (A, \vec{x}), λ_{1} = max_{\vec{x} \neq \vec{0}} R (A, \vec{x})$
  证明参考：《Numerical linear algebra》。
考察 DeepWalk 的矩阵分解：
$X Y^{⊤} = \log (\frac{{vol}_{G}}{T} (\sum_{r = 1}^{T} P^{r}) D^{- 1}) - (\log b) I$
忽略常量以及 element-wiselog $\frac{1}{T}\left(\sum_{r=1}^T \mathbf P^r\right)\mathbf D^{-1}$ 。
$\mathbf D^{-1/2}\mathbf A\mathbf D^{-1/2} = \mathbf I - \mathbf L$ $\mathbf U\mathbf\Lambda \mathbf U^\top$ $\mathbf U$ $\mathbf\Lambda=\text{diag}(\lambda_1,\cdots,\lambda_n)$ $1=\lambda_1\ge \lambda_2\cdots\ge\lambda_n\ge -1$ $\lambda_n\lt -\frac{1}{n-1}\lt0$ 。
$\mathbf P = \mathbf D^{-1}\mathbf A= \mathbf D^{-1/2}\mathbf A\mathbf D^{-1/2}$ ，因此有：
$\frac{1}{T} (\sum_{r = 1}^{T} P^{r}) D^{- 1} = (D^{- 1 / 2}) (U (\frac{1}{T} \sum_{r = 1}^{T} Λ^{r}) U^{⊤}) (D^{- 1 / 2})$
- $\mathbf U\left(\frac{1}{T}\sum_{r=1}^T \mathbf\Lambda^r\right) \mathbf U^\top$ 的谱。显然，它具有特征值：
  ${\frac{1}{T} \sum_{r = 1}^{T} λ_{1}^{r}, \frac{1}{T} \sum_{r = 1}^{T} λ_{2}^{r}, \dots, \frac{1}{T} \sum_{r = 1}^{T} λ_{i}^{r}, \dots \frac{1}{T} \sum_{r = 1}^{T} λ_{n}^{r}}$
  $\mathbf D^{-1/2}\mathbf A\mathbf D^{-1/2}$ $\lambda_i$ $f(x) = \frac 1T\sum_{r=1}^T x^r$ 。这个映射可以视为一个滤波器，滤波器的效果如下图所示。可以看到：
  - 滤波器倾向于保留正的、大的特征值。
  - $T$ 的增加，这种偏好变得更加明显。
  $T$ 的增加，滤波器尝试通过保留较大的、正的特征值来近似低阶半正定矩阵。
- $\frac{1}{T}\left(\sum_{r=1}^T \mathbf P^r\right)\mathbf D^{-1}$ 的谱。
  $\mathbf U\left(\frac{1}{T}\sum_{r=1}^T \mathbf\Lambda^r\right) \mathbf U^\top$ $|\frac 1T \sum_{r=1}^T \lambda_i^r|,i=1,2,\cdots,n$ $\{p_1,p_2,\cdots,p_n\}$ ，则有：
  $| \frac{1}{T} \sum_{r = 1}^{T} λ_{p_{1}}^{r} | \geq | \frac{1}{T} \sum_{r = 1}^{T} λ_{p_{2}}^{r} | \geq \dots \geq | \frac{1}{T} \sum_{r = 1}^{T} λ_{p_{n}}^{r} |$
  $d_i$ $\mathbf D^{-1/2}$ $\frac{1}{\sqrt d_i}$ $\{q_1,q_2,\cdots,q_n\}$ ，则有：
  $\frac{1}{\sqrt{d_{q_{1}}}} \geq \frac{1}{\sqrt{d_{q_{2}}}} \geq \dots \geq \frac{1}{\sqrt{d_{q_{n}}}}$
  $d_{q_1} = d_{\min}$ 为最小的 degree 。
  $s$ 个奇异值满足：
  $\begin{matrix} σ_{s} ((\frac{1}{T} \sum_{r = 1}^{T} P^{r}) D^{- 1}) \leq σ_{1} (D^{- 1 / 2}) σ_{s} (U (\frac{1}{T} \sum_{r = 1}^{T} Λ^{r}) U^{⊤}) σ_{1} (D^{- 1 / 2}) \\ = \frac{1}{\sqrt{d_{q_{1}}}} | \frac{1}{T} \sum_{r = 1}^{T} λ_{p_{s}}^{r} | \frac{1}{\sqrt{d_{q_{1}}}} = \frac{1}{d_{min}} | \frac{1}{T} \sum_{r = 1}^{T} λ_{p_{s}}^{r} | \end{matrix}$
  $\frac{1}{T}\left(\sum_{r=1}^T \mathbf P^r\right)\mathbf D^{-1}$ $s$ $\frac{1}{d_{\min}}\left|\frac 1T \sum_{r=1}^T\lambda_{p_s}^r\right|$ 。
  另外，根据瑞利商，我们有：
  $\begin{matrix} R ((\frac{1}{T} \sum_{r = 1}^{T} P^{r}) D^{- 1}, \vec{x}) = R (U (\frac{1}{T} \sum_{r = 1}^{T} Λ^{r}) U^{⊤}, D^{- 1 / 2} \vec{x}) R (D^{- 1}, \vec{x}) \\ \geq λ_{min} (U (\frac{1}{T} \sum_{r = 1}^{T} Λ^{r}) U^{⊤}) λ_{min} (D^{- 1}) \\ = \frac{1}{d_{min}} λ_{min} (U (\frac{1}{T} \sum_{r = 1}^{T} Λ^{r}) U^{⊤}) \end{matrix}$
  应用定理七，我们有：
  $λ_{min} ((\frac{1}{T} \sum_{r = 1}^{T} P^{r}) D^{- 1}) \geq \frac{1}{d_{min}} λ_{min} (U (\frac{1}{T} \sum_{r = 1}^{T} Λ^{r}) U^{⊤})$
$f(x) = \frac 1T\sum_{r=1}^T x^r$ 的效果，我们分析了 Coralargest connected component $\mathbf D^{-1/2}\mathbf A \mathbf D^{-1/2}$ $\mathbf U\left(\frac {1}{T} \sum_{r=1}^T \mathbf\Lambda^r\right)\mathbf U^\top$ $\left(\frac 1T \sum_{r=1}^T \mathbf P^r\right)\mathbf D^{-1}$ $T=10$ 。
- $\mathbf D^{-1/2}\mathbf A \mathbf D^{-1/2}$ $\lambda_1= 1$ $\lambda_n = -0.971$ 。
- $\mathbf U\left(\frac {1}{T} \sum_{r=1}^T \mathbf\Lambda^r\right)\mathbf U^\top$ ，我们发现：它的所有负特征值以及一些小的正特征值都被过滤掉了。
- $\left(\frac 1T \sum_{r=1}^T \mathbf P^r\right)\mathbf D^{-1}$ ，我们发现：
  - $\mathbf U\left(\frac {1}{T} \sum_{r=1}^T \mathbf\Lambda^r\right)\mathbf U^\top$ 的奇异值所限制。
  - $\mathbf U\left(\frac {1}{T} \sum_{r=1}^T \mathbf\Lambda^r\right)\mathbf U^\top$ 的特征值所限制。
基于前面的理论分析，我们提出了一个矩阵分解框架 NetMF ，它是对 DeepWalk 和 LINE 的改进。
为表述方便，我们定义：
$M = \frac{{vol}_{G}}{b T} (\sum_{r = 1}^{T} P^{r}) D^{- 1}$
$\mathbf X \mathbf Y^\top = \log \mathbf M$ 对应于 DeepWalk 的矩阵分解。
- $T$ $\mathbf M$ $\log \mathbf M$ $\log \mathbf M$ $M_{i,j} = 0$ $\log M_{i,j}$ $\log \mathbf M$ 是一个巨大的稠密矩阵，计算复杂度太高。
  Shifted PPMI $\mathbf M^\prime = \max(\mathbf M,1)$ $\log \mathbf M^\prime$ $\log \mathbf M^\prime$ $\log \mathbf M^\prime$ top $d$ 奇异值和奇异向量来构造embedding 向量。
- $T$ $\mathbf M$ $\mathbf M$ $\mathbf M$ 。
  - $\mathbf D^{-1/2}\mathbf A\mathbf D^{-1/2}$ top $h$ $\mathbf U_h\mathbf\Lambda_h\mathbf U_h^{\top}$ $\mathbf D^{-1/2}\mathbf A\mathbf D^{-1/2}$ top $h$ 个特征值被使用，并且涉及的矩阵是稀疏的，因此我们可以使用 Arnoldi 方法来大大减少时间。
    大型稠密矩阵的特征值分解的代价太高，实际是不可行的。
  - $\hat {\mathbf M} = \frac{\text{vol}_G}{b }\mathbf D ^{-1/2}\mathbf U_h\left(\frac 1T\sum_{r=1}^T\mathbf \Lambda_h^r\right) \mathbf U_h^\top\mathbf D^{-1/2}$ $\mathbf M$ 。
NetMF 算法：
- 输入：
  - $G(V,E,\mathbf A)$
  - $T$
- embedding $\mathbf X$
- 算法步骤：
  - $T$ 较小，则计算：
    $\begin{matrix} P^{1}, \dots, P^{T} \\ M = \frac{{vol}_{G}}{b T} (\sum_{r = 1}^{T} P^{r}) D^{- 1} \\ M^{'} = max (M, 1) \end{matrix}$
    $T$ $\mathbf D^{-1/2}\mathbf A \mathbf D^{-1/2} \simeq \mathbf U_h\mathbf\Lambda_h\mathbf U_h^\top$ 。然后计算：
    $\begin{matrix} \hat{M} = \frac{{vol}_{G}}{b} D^{- 1 / 2} U_{h} (\frac{1}{T} \sum_{r = 1}^{T} Λ_{h}^{r}) U_{h}^{⊤} D^{- 1 / 2} \\ {\hat{M}}^{'} = max (\hat{M}, 1) \end{matrix}$
  - $d$ SVD $\log \mathbf M^\prime = \mathbf U_d \Sigma_d \mathbf V_d^\top$ $\log \hat{\mathbf M}^\prime = \mathbf U_d \Sigma_d \mathbf V_d^\top$ 。
  - $\mathbf U_d\sqrt\Sigma_d$ 作为 network embedding。
$T$ $\hat{\mathbf M}$ $\mathbf M$ $\log \hat{\mathbf M}^\prime$ $\log \mathbf M^\prime$ 的误差上界。
$||\cdot||_F$ 为矩阵的 Frobenius 范数，则有：
$\begin{matrix} {‖ M - \hat{M} ‖}_{F} \leq \frac{{vol}_{G}}{b d_{min}} \sqrt{\sum_{j = k + 1}^{n} | \frac{1}{T} \sum_{r = 1}^{T} λ_{j}^{r} |} \\ {‖ \log M^{'} - \log {\hat{M}}^{'} ‖}_{F} \leq {‖ M^{'} - {\hat{M}}^{'} ‖}_{F} \leq {‖ M - \hat{M} ‖}_{F} \end{matrix}$
证明：
- 第一个不等式：可以通过 F 范数的定义和前面的定理七来证明。
- $M^\prime_{i,j}\le \hat M_{i,j}^\prime$ ，则有：
  $\begin{matrix} | \log M_{i, j}^{'} - \log {\hat{M}}_{i, j}^{'} | = \log (1 + \frac{{\hat{M}}_{i, j}^{'} - M_{i, j}^{'}}{M_{i, j}^{'}}) \\ \leq \frac{{\hat{M}}_{i, j}^{'} - M_{i, j}^{'}}{M_{i, j}^{'}} \leq {\hat{M}}_{i, j}^{'} - M_{i, j}^{'} = | {\hat{M}}_{i, j}^{'} - M_{i, j}^{'} | \end{matrix}$
  $x\ge 0$ $\log (1+x)\le x$ $M_{i,j}^\prime = \max(M_{i,j},1) \ge 1$ $\left\|\log \mathbf M^\prime - \log \hat{\mathbf M}^\prime\right\|_F \le \left\|\mathbf M^\prime - \hat{\mathbf M}^\prime\right\|_F$ 。
  $\mathbf M^\prime$ $\hat{\mathbf M}^\prime$ 的定义有：
  $| M_{i, j}^{'} - {\hat{M}}_{i, j}^{'} | = | max (M_{i, j}, 1) - max ({\hat{M}}_{i, j}, 1) | \leq | M_{i, j} - {\hat{M}}_{i, j} |$
  $\left\|\mathbf M^\prime - \hat{\mathbf M}^\prime\right\|_F \le \left\|\mathbf M - \hat{\mathbf M}\right\|_F$ 。

1.2 实验

这里我们在在多标签顶点分类任务中评估 NetMF 的性能，该任务也被 DeepWalk, LINE, node2vec 等工作所采用。
数据集：
- BlogCatalog 数据集：在线博主的社交关系网络，标签代表博主的兴趣。
- Flickr 数据集：Flickr网站用户之间的关系网络，标签代表用户的兴趣组，如“黑白照片”。
- Protein-Protein Interactions:PPI：智人 PPI 网络的子集，标签代表标志基因组和代表的生物状态。
- Wikipedia 数据集：来自维基百科，包含了英文维基百科 dump 文件的前一百万个字节中的单词共现网络。顶点的标签表示通过Stanford POS-Tagger推断出来的单词词性（Part-of-Speech: POS）。
这些数据集的统计信息如下表所示。
Baseline 模型和配置：我们将 NetMF(T=1)、NetMF(T=10) 和 LINE(2nd), DeepWalk 进行比较。
- 所有模型的 embedding 维度都是 128 维。
- 对于 NetMF(T=10)Flickr $h=16384$ $h=256$ 。
- 对于 DeepWalk ，我们选择窗口大小为 10、随机游走序列长度 40、每个顶点开始的随机游走序列数量为 80 。
我们重点将 NetMF(T=1) 和 LINE(2nd) 进行比较，因为二者窗口大小都为 1 ；重点将 NetMF(T=10) 和 DeepWalk 进行比较，因为二者窗口大小都为 10 。
和 DeepWalk 相同的实验步骤，我们首先训练整个网络的 embedding，然后随机采样一部分标记样本来训练一个one-vs-rest 逻辑回归分类模型，剩余的顶点作为测试集。在测试阶段，one-vs-rest 模型给出 label 的排名，而不是最终的 label 分配。为了避免阈值效应，我们假设测试集的 label 数量是给定的。
对于 BlogCatalog,PPI,Wikipedia 数据集，我们考察分类训练集占比 10%~90% 的情况下，各模型的性能；对于 Flickr 数据集，我们考察分类训练集占比 1%~10% 的情况下，各模型的性能。我们评估测试集的 Micro-F1 指标和 Macro-F1 指标。为了确保实验结果可靠，每个配置我们都重复实验 10 次，并报告测试集指标的均值。
完成的实验结果如下图所示。可以看到：NetMF(T=1) 相对于 LINE(2nd) 取得了性能的提升（它们的上下文窗口 T=1 ），NetMF(T=10) 相对于 DeepWalk 也取得了性能提升（它们的上下文窗口 T=10 ）。
- 在 BlogCatalog,PPI,Flickr 数据集中，我们提出的 NetMF(T=10) 比 Baseline 性能更好。这证明了我们提出的理论基础在 network embedding 的有效性。
- 在 Wikipedia 数据集中，窗口更小的 NetMF(T=1) 和 LINE(2nd) 效果更好。这表明：短期依赖足以建模 Wikipedia 网络结构。这是因为 Wikipedia 网络是一个平均 degree 为 77.11 的稠密的单词共现网络，大量单词之间存在共现关系。
- 如下表所示，大多数情况下当标记数据稀疏时，NetMF 方法远远优于 DeepWalk 和 LINE 。
- DeepWalk 尝试通过随机游走采样，从而用经验分布来逼近真实的 vertex-context 分布。尽管大数定律可以保证这种方式的收敛性，但是实际上由于真实世界网络规模较大，而且实际随机游走的规模有限（随机游走序列的长度、随机游走序列的数量），因此经验分布和真实分布之间存在gap ，从而对 DeepWalk 的性能产生不利影响。
  NetMF 通过直接建模真实的 vertex-context 分布，从而得到比DeepWalk更好的效果。